Introduction to R and Rstudio

Session - joins

Zoë Turner

Relational data

It’s rare to find all the data you need for an analysis in a single table.

Typically, you’ll have to link two (or more) tables together by matching on common “key” variable(s).

We use joins in SQL or R, (V, H or X) LOOKUPs in Excel

Joins

All the joins in SQL are available in R and a couple more called filtering joins which have the same concept as EXISTS used in the WHERE clause in SQL.

left_join()

Keep structure of table x

…and match to observations in table y

“key” variable (common to both tables)

x |> 
  left_join(y, by = "id")
Moving gif that shows two tables and their merge with the data on the left 1, 2 and 3 being added to from the right where it matches

source: https://github.com/gadenbuie/tidyexplain

Have a go!

We’re going to join two tables one with cases of tuberculosis by country, one with population by country.

Import the files:

tb_cases.csv tb_pop.csv tb_new_table.csv

08:00

Join 2 tables together

When columns have the same name R resolves this conflict by adding .x and .y to the end of the column name

tb_cases |> 
  left_join(tb_pop, by = "country")
# A tibble: 64 × 5
   country     year.x  cases year.y        pop
   <chr>        <dbl>  <dbl>  <dbl>      <dbl>
 1 Afghanistan   1999    745   1999   19987071
 2 Afghanistan   1999    745   2000   20595360
 3 Afghanistan   1999    745   2001   21347782
 4 Afghanistan   1999    745   2002   22202806
 5 Brazil        1999  37737   1999  172006362
 6 Brazil        1999  37737   2000  174504898
 7 Brazil        1999  37737   2001  176968205
 8 Brazil        1999  37737   2002  179393768
 9 China         1999 212258   1999 1272915272
10 China         1999 212258   2000 1280428583
# … with 54 more rows

Join on multiple columns

Explicitly joining on more than one column requires the use of a vector

tb_cases |> 
  left_join(tb_pop, by = c("country", "year"))

Vectors

You can create a vector with function: c() for concatenate/combine

c("beds", "staff", "patients")

Default combine

{dplyr} joins can occur automatically

tb_cases |> 
  left_join(tb_pop)
# A tibble: 16 × 4
   country      year  cases        pop
   <chr>       <dbl>  <dbl>      <dbl>
 1 Afghanistan  1999    745   19987071
 2 Brazil       1999  37737  172006362
 3 China        1999 212258 1272915272
 4 Denmark      1999    170    5319410
 5 Afghanistan  2000   2666   20595360
 6 Brazil       2000  80488  174504898
 7 China        2000 213766 1280428583
 8 Denmark      2000    171    5338283
 9 Afghanistan  2001   4639   21347782
10 Brazil       2001  37491  176968205
11 China        2001 212766 1287890449
12 Denmark      2001    124    5354684
13 Afghanistan  2002   6509   22202806
14 Brazil       2002  40723  179393768
15 China        2002 194972 1295322020
16 Denmark      2002    135    5368994

Joining with different names

If two tables have different names for same variable:

tb_cases |> 
  left_join(tb_new_table, by = c("country" = "Place",
                                 "year" = "Yr"))

semi_join()

Semi-joins retain the data on the left if it matches the right, but drops the right (so is not like an inner join that keeps both sides).

All rows from x where there are matching values in y, keeping just columns from x.

Moving gif showing the data from the left table being kept if it matches the right but everything is dropped from the right

Use Case - hospital Covid tests

Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.

Join to the tb_new_table and bring back only those records where the column FirstLetter is ‘A’ but drop the data from tb_new_table.

tb_cases |> 
  semi_join(tb_new_table |> 
              filter(FirstLetter == "A"), by = c("country" = "Place",
                                                 "year" = "Yr"))
# A tibble: 4 × 3
  country      year cases
  <chr>       <dbl> <dbl>
1 Afghanistan  1999   745
2 Afghanistan  2000  2666
3 Afghanistan  2001  4639
4 Afghanistan  2002  6509

anti_join()

All rows from x where there are not matching values in y, keeping just columns from x.

Moving gif that only retrains from the left table what doesn't match on the right

Use Case - text mining stop words

In text mining, exclude ‘stop words’ (but, and, or) which are kept in a separate table for reference.

Join to the tb_new_table and exclude only those records where the column FirstLetter is ‘A’ .

tb_cases |> 
  anti_join(tb_new_table |> 
              filter(FirstLetter == "A"), by = c("country" = "Place",
                                                 "year" = "Yr"))
# A tibble: 12 × 3
   country  year  cases
   <chr>   <dbl>  <dbl>
 1 Brazil   1999  37737
 2 China    1999 212258
 3 Denmark  1999    170
 4 Brazil   2000  80488
 5 China    2000 213766
 6 Denmark  2000    171
 7 Brazil   2001  37491
 8 China    2001 212766
 9 Denmark  2001    124
10 Brazil   2002  40723
11 China    2002 194972
12 Denmark  2002    135

End session